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ABSTRACT 

The  power  of  sparse  signal  coding  with  learned  dictionaries  has  been 
demonstrated  in  a  variety  of  applications  and  fields,  from  signal  pro¬ 
cessing  to  statistical  inference  and  machine  learning.  However,  the 
statistical  properties  of  these  models,  such  as  underfitting  or  overfit¬ 
ting  given  sets  of  data,  are  still  not  well  characterized  in  the  literature. 
This  work  aims  at  filling  this  gap  by  means  of  the  Minimum  De¬ 
scription  Length  (MDL)  principle  -  a  well  established  information- 
theoretic  approach  to  statistical  inference.  The  resulting  framework 
derives  a  family  of  efficient  sparse  coding  and  modeling  (dictio¬ 
nary  learning)  algorithms,  which  by  virtue  of  the  MDL  principle, 
are  completely  parameter  free.  Furthermore,  such  framework  al¬ 
lows  to  incorporate  additional  prior  information  in  the  model,  such 
as  Markovian  dependencies,  in  a  natural  way.  We  demonstrate  the 
performance  of  the  proposed  framework  with  results  for  image  de- 
noising  and  classification  tasks. 

Index  Terms —  Sparse  coding,  dictionary  learning,  MDL,  de- 
noising,  classification 

1.  INTRODUCTION 

Sparse  models  are  by  now  well  established  in  a  variety  of  fields  and 
applications,  including  signal  processing,  machine  learning,  and  sta¬ 
tistical  inference,  e.g.  [1,  2,  3]  and  references  therein. 

When  sparsity  is  a  modeling  device  and  not  a  hypothesis  about 
the  nature  of  the  analyzed  signals,  parameters  such  as  the  desired 
sparsity  in  the  solutions,  or  the  size  of  the  dictionaries  to  be  learned, 
play  a  critical  role  in  the  effectiveness  of  sparse  models  for  the  tasks 
at  hand.  However,  lacking  theoretical  guidelines  for  such  param¬ 
eters,  published  applications  based  on  learned  sparse  models  often 
rely  on  either  cross-validation  or  ad-hoc  methods  for  learning  such 
parameters  (an  exception  for  example  being  the  Bayesian  approach, 
e.g.,  [4])  .  Clearly,  such  techniques  can  be  impractical  and/or  inef¬ 
fective  in  many  cases. 

At  the  bottom  of  this  problem  lie  fundamental  questions  such 
as:  how  rich  or  complex  is  a  sparse  model?  how  does  this  depend  on 
the  required  sparsity  of  the  solutions,  or  the  size  of  the  dictionaries? 
what  is  the  best  model  for  a  given  data  class?  A  possible  objective 
answer  to  such  questions  is  provided  by  the  Minimum  Description 
Length  principle  (MDL)  [5],  a  general  methodology  for  assessing 
the  ability  of  statistical  models  to  capture  regularity  from  data.  The 
MDL  principle  is  often  regarded  as  a  practical  implementation  of  the 
Occam’s  razor  principle,  which  states  that,  given  two  descriptions 
for  a  given  phenomenon,  the  shorter  one  is  usually  the  best.  In  a 
nutshell,  MDL  equates  “ability  to  capture  regularity”  with  “ability  to 
compress”  the  data,  and  the  metric  with  which  models  are  measured 
in  MDL  is  codelength  or  compressibility. 


The  idea  of  using  MDL  for  sparse  signal  coding  was  explored  in 
the  context  of  wavelet-based  image  denoising  [6,  7].  These  pioneer¬ 
ing  works  were  restricted  to  denoising  using  fixed  orthonormal  basis 
(wavelets).  In  addition,  the  underlying  probabilistic  models  used 
to  describe  the  transform  coefficients,  which  are  the  main  technical 
choice  to  make  when  applying  MDL  to  a  problem,  were  not  well 
suited  to  the  actual  statistical  properties  of  the  modeled  data  (image 
encoding  coefficients),  thus  resulting  in  poor  performance.  Further¬ 
more,  these  works  did  not  consider  the  critical  effects  of  quantization 
in  the  coding,  which  needs  to  be  taken  into  account  when  working 
in  a  true  MDL  framework  (a  useful  model  needs  to  be  able  to  com¬ 
press,  that  is,  it  needs  to  produce  actual  codes  which  are  shorter  than 
a  trivial  description  of  the  data).  Finally,  these  models  are  designed 
for  noisy  data,  with  no  provisions  for  the  important  case  of  noiseless 
data  modeling,  which  addresses  the  errors  due  to  deviations  from  the 
model,  and  is  critical  in  many  applications  such  as  classification. 

The  framework  presented  in  this  work  addresses  all  of  the  above 
issues  in  a  principled  way:  i)  Efficient  codes  are  used  to  describe 
the  encoded  data;  ii)  Deviations  from  the  model  are  taken  into  ac¬ 
count  when  modeling  approximation  errors,  thus  being  more  general 
and  robust  than  traditional  sparse  models;  iii)  Probability  models  for 
(sparse)  transform  coefficients  are  corrected  to  take  into  account  the 
high  occurrence  of  zeros;  iv)  Quantization  is  included  in  the  model, 
and  its  effect  is  treated  rigorously;  v)  Dictionary  learning  is  formu¬ 
lated  in  a  way  which  is  consistent  with  the  model’s  statistical  as¬ 
sumptions.  At  the  theoretical  level,  this  brings  us  a  step  closer  to  the 
fundamental  understanding  of  sparse  models  and  brings  a  different 
perspective,  that  of  MDL,  into  the  sparse  world.  From  a  practical 
point  of  view,  the  resulting  framework  leads  to  coding  and  modeling 
algorithms  which  are  completely  parameter  free  and  computation¬ 
ally  efficient,  and  practically  effective  in  a  variety  of  applications. 

Another  important  attribute  of  the  proposed  family  of  models  is 
that  prior  information  can  be  easily  and  naturally  introduced  via  the 
underlying  probability  models  defining  the  codelengths.  The  effect 
of  such  priors  can  then  be  quickly  assessed  in  terms  of  the  new  code¬ 
lengths  obtained.  For  example,  Markovian  dependencies  between 
sparse  codes  of  adjacent  image  patches  can  be  easily  incorporated. 

2.  BACKGROUND  ON  SPARSE  MODELS 

Assume  we  are  given  n  ra-dimensional  data  samples  ordered  as 
columns  of  a  matrix  Y  =  [yi  |y2 1  •  •  •  \yn]  G  MmXn.  Consider  a 
linear  model  for  Y,  Y  =  DA  +  E,  where  D  =  [di  |d2 1  • . .  |dp]  is 
an  mxp  dictionary  consisting  of  p  atoms,  A  =  [ai |a2 1  . . .  |an]  E 
M.pXn  is  a  matrix  of  coefficients  where  each  j- th  column  a j  specifies 
the  linear  combination  of  columns  of  D  that  approximates  y j ,  and 
E  =  [ei  |e2|  .  • .  |en]  E  RmXn  is  a  matrix  of  approximation  errors. 
We  say  that  the  model  is  sparse  if,  for  all  or  most  j  —  1 , . . . ,  n,  we 
can  achieve  \\ej  ||2  ~  0  while  requiring  that  only  a  few  coefficients 
7  <C  p  in  a.j  can  be  nonzero. 

The  problem  of  sparsely  representing  Y  in  terms  of  D,  which 
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we  refer  to  as  the  sparse  coding  problem ,  can  be  written  as 

aj=argmin||a||0  s.t.  ||yj  — Da||2  <  e,  j  =  1, . . .  ,n,  (1) 

a 

where  ||a||0  =  7  is  the  pseudo-norm  that  counts  the  number  of 
nonzero  elements  in  a  vector  a,  and  e  is  some  small  constant.  There 
is  a  body  of  results  showing  that  the  problem  (1),  which  is  non- 
convex  and  NP-hard,  can  be  solved  exactly  when  certain  conditions 
on  D  and  A  are  met,  by  either  well  known  greedy  methods  such  as 
Matching  Pursuit  [8],  or  by  solving  a  convex  approximation  to  (1), 
commonly  referred  to  as  Basis  Pursuit  (BP)  [9], 

kj  =  argmin  Ha^  s.t.  ||yj  -  Da||2  <  e,  j  =  1, . . . ,  n.  (2) 

a 

When  D  is  included  as  an  optimization  variable,  we  refer  to  the  re¬ 
sulting  problem  as  sparse  modeling.  This  problem  is  often  written 
in  unconstrained  form, 

n  1 

(D,  A)  =  argmin ^  -  ||yj  -  Da^lj  +  A  ||s.j  ||j.,  (3) 

3  = 1 

where  A  >  0  is  an  arbitrary  constant.  The  problem  in  this  case  is 
non-convex  in  (D,  A),  and  one  must  be  contempt  with  finding  local 
minima.  Despite  this  drawback,  in  recent  years,  models  learned  by 
(approximately)  minimizing  (3)  have  shown  to  be  very  effective  for 
signal  analysis,  leading  to  state-of-the-art  results  in  several  applica¬ 
tions  such  as  image  restoration  and  classification. 

2.1.  Model  complexity  of  sparse  models 

In  sparse  modeling  problems  where  D  is  learned,  parameters  such  as 
the  desired  sparsity  7,  the  penalty  A  in  (3),  or  the  number  of  atoms 
p  in  D,  must  be  chosen  individually  for  each  application  and  type 
of  data  to  produce  good  results.  In  such  cases,  most  sparse  model¬ 
ing  techniques  end-up  using  cross-validation  or  ad-hoc  techniques  to 
select  these  critical  parameters.  An  alternative  formal  path  is  to  pos¬ 
tulate  a  Bayesian  model  where  these  parameters  are  assigned  prior 
distributions,  and  such  priors  are  adjusted  through  learning.  This  ap¬ 
proach,  followed  for  example  in  [4],  adds  robustness  to  the  modeling 
framework,  but  leaves  important  issues  unsolved,  such  as  providing 
objective  means  to  compare  different  models  (with  different  priors, 
for  example).  The  use  of  Bayesian  sparse  models  implies  having  to 
repeatedly  solve  possibly  costly  optimization  problems,  increasing 
the  computational  burden  of  the  applications. 

In  this  work  we  propose  to  use  the  MDL  principle  to  formally 
tackle  the  problem  of  sparse  model  selection.  The  goal  is  twofold: 
for  sparse  coding  with  fixed  dictionary,  we  want  MDL  to  tell  us  the 
set  of  coefficients  that  gives  us  the  shortest  description  of  a  given 
sample.  For  dictionary  learning,  we  want  to  obtain  the  dictionary 
which  gives  us  the  shortest  average  description  of  all  data  samples 
(or  a  representative  set  of  samples  from  some  class).  A  detailed  de¬ 
scription  of  such  models,  and  the  coding  and  modeling  algorithms 
derived  from  them,  is  the  subject  of  the  next  section. 

3.  MDL-BASED  SPARSE  CODING  AND  MODELING 
FRAMEWORK 

Sparse  models  break  the  input  data  Y  into  three  parts:  a  dictionary 
D,  a  set  of  coefficients  A,  and  a  matrix  of  reconstruction  errors.  In 
order  to  apply  MDL  to  a  sparse  model,  one  must  provide  codelength 
assignments  for  these  components,  L(A),  L(D)  and  L(E),  so  that 
the  total  codelength  L(Y)  =  L(E)  +  L(A)  +  L(D)  can  be  com¬ 
puted.  In  designing  such  models,  it  is  fundamental  to  incorporate  as 


much  prior  information  as  possible  so  that  no  cost  is  paid  in  learning 
already  known  statistical  features  of  the  data,  such  as  invariance  to 
certain  transformations  or  symmetries.  Another  feature  to  consider 
in  sparse  models  is  the  predominance  of  zeroes  in  A.  In  MDL,  all 
this  prior  information  is  embodied  in  the  probability  models  used 
for  encoding  each  component.  What  follows  is  a  description  of  such 
models. 

Sequential  coding-  In  the  proposed  framework,  Y  is  encoded  se¬ 
quentially,  one  column  yj  at  a  time,  for  j  =  1 ,  2, . . . ,  n,  possibly  us¬ 
ing  information  (including  dependencies)  from  previously  encoded 
columns.  However,  when  encoding  each  column  yj,  its  sparse  coef¬ 
ficients  a  j  are  modeled  as  and  IID  sequence  (of  fixed  length  p). 
Quantization-  To  achieve  true  compression,  the  finite  precision  of 
the  input  data  Y  must  be  taken  into  account.  In  the  case  of  digital 
images  for  example,  elements  from  Y  usually  take  only  256  possi¬ 
ble  values  (from  0  to  255  in  steps  of  size  5y  =  1).  Since  there  is 
no  need  to  encode  E  with  more  precision  than  Y,  we  set  the  error 
quantization  step  to  Se  =  Sy.  As  for  D  and  A,  the  corresponding 
quantization  steps  Sa  and  Sd  need  to  be  fine  enough  to  produce  fluc¬ 
tuations  on  DA  which  are  smaller  than  the  precision  of  Y,  but  not 
more.  Therefore,  the  actual  distributions  used  are  discretized  ver¬ 
sions  of  the  ones  discussed  below. 

Error-  The  elements  of  E  are  encoded  with  an  IID  model  where 
the  random  value  of  a  coefficient  (represented  by  an  r.v.  e)  is  the 
linear  superposition  of  two  effects:  e  =  e  +  77,  where  77  ~  J\f(0,  cry), 
cr2  assumed  known,  models  noise  in  Y  due  to  measurement  and/or 
quantization,  and  e  ~  £(0, 6e)  is  a  zero  mean,  heavy-tailed  (Lapla- 
cian)  error  due  to  the  model.  The  resulting  distribution  for  e,  which 
is  the  convolution  of  the  involved  Laplacian  and  Gaussian  distribu¬ 
tions,  was  developed  in  [10]  under  the  name  “LG.”  This  model  will 
be  referred  to  as  Pe  (• ;  cr2 ,  0e)  hereafter. 

Sparse  code-  Each  coefficient  in  A  is  modeled  as  the  product  of 
three  (non-independent)  random  variables  (see  also  [4]  for  a  related 
model),  a  =  +  S. a),  where  (  ~  Ber(p)  is  a  support  indi¬ 

cator,  that  is,  C  =  1  implies  a  ^  0,  0  =  sgn(a),  and  v  — 
max{|a|  —  Sa:  0}  is  the  absolute  value  of  a  corrected  for  the  fact 
that  v  >  8 a  when  £  *=  1.  Conditioned  on  C  —  0,  <f)  =  v  =  0  with 
probability  1.  Conditioned  on  £  =  1,  we  assume  <f>  ~  Ber(l/2), 
and  v  to  be  Exp (6U).  Note  that,  with  these  choices,  P(cj)v |£  =  1) 
is  a  Laplacian,  which  is  a  standard  model  for  transform  (e.g.,  DCT, 
Wavelet)  coefficients.  The  probability  models  for  the  variables  £  and 
v  will  be  denoted  as  P^(-;  p)  and  PU(-]0U)  respectively. 
Dictionary-  We  assume  the  elements  of  D  to  be  uniformly  dis¬ 
tributed  on  [—1,1].  Following  the  standard  MDL  recipe  for  en¬ 
coding  model  parameter  values  learned  from  n  samples,  we  use 
a  quantization  step  Sd  =  n-1//2  [5].  For  these  choices  we  have 
L(D)  =  mp  log2  n,  which  does  not  depend  on  the  element  values 
of  D  but  only  on  the  number  of  atoms  p  and  the  size  of  Y.  Other 
possible  models  which  impose  structure  in  D,  such  as  smoothness  in 
the  atoms,  are  natural  to  the  proposed  framework  and  will  be  treated 
in  the  extended  version  of  this  work. 

3.1.  Universal  models  for  unknown  parameters 

The  above  probability  models  for  the  error  e,  support  (  and  (shifted) 
coefficient  magnitude  v  depend  on  parameters  which  are  not  known 
in  advance.  In  contrast  with  Bayesian  approaches,  cross-validation, 
or  other  techniques  often  used  in  sparse  modeling,  modern  MDL 
solves  this  problem  efficiently  by  means  of  the  so  called  universal 
probability  models  [5].  In  a  nutshell,  universal  models  provide  op¬ 
timal  codelengths  using  probability  distributions  of  a  known  family, 
with  unknown  parameters,  thus  generalizing  the  classic  results  from 


Shannon  theory  [11]. 

Following  this,  we  substitute  Pq(--,  p),  Pe(-;  cr^,  0e)  and  PU(-]0V) 
with  corresponding  universal  models.  For  describing  a  given  support 
z,  we  use  an  enumerative  code  [12],  which  first  describes  the  size  of 
the  support  7  with  log2  p  bits,  and  then  the  particular  arrangement 
of  non-zeros  in  z  using  log2  (^)  bits.  Pe(*;  cr^Oe)  and  Pu(m]  0V)  are 
substituted  by  corresponding  universal  mixture  models,  one  of  the 
possibilities  dictated  by  the  theory  of  universal  modeling, 

r+ 00 

Qe(u;cr‘?j)  =  we{P)Pe{u\  07,  Q)dO, 

Jo 

/*+ 00 

Qv(u)  —  /  wu{0)Pu{u\6)d01 

Jo 

where  the  mixing  functions  we{6)  and  wa(0)  are  Gamma  dis¬ 
tributions  (the  conjugate  prior  for  the  exponential  distribution), 
w(0\k,/3)  =  T(k)~16k~1  /3Ke~^e ,  0  E  M+  with  fixed  parameters 
(fte,  /3e)  and  (ku,I3v)  respectively.  The  resulting  Mixture  of  Expo¬ 
nentials  (MOE)  distribution  Q  *,(•),  is  given  by  (see  [  ]  for  details), 

Qu(u\/3u,ku)  =  KuPvv{u  +  Pv)~^Kv+1\  u  E  R+.  Observing 
that  the  convolution  and  the  convex  mixture  operations  that  result 
in  Qe(s  cr^j)  are  interchangeable  (both  integrals  are  finite),  it  turns 
out  that  Qe(  ;  cr^)  is  a  convolution  of  a  MOE  (of  hyper-parameters 
(fte,/3e))  and  a  Gaussian  with  parameter  a^.  Thus,  although  the 
explicit  formula  for  this  distribution,  which  we  call  MOEG,  is  cum¬ 
bersome,  we  can  easily  combine  the  results  in  [10]  for  the  LG, 
and  for  MOE  in  [  3]  to  perform  tasks  such  as  parameter  estimation 
within  this  model.  Note  that  the  universality  of  these  mixture  models 
does  not  depend  on  the  values  of  the  hyper-parameters,  and  their 
choice  has  little  impact  on  their  overall  performance.  Here,  guided 
by  [13],  we  set  kv  —  Ke  —  3.0,  f3u  =  fde  =  Sa. 

Following  standard  practice  in  MDL,  the  ideal  Shannon  code  is 
used  to  translate  probabilities  into  codelengths.  Under  this  scheme,  a 
sample  value  u  with  probability  P(u)  is  assigned  a  code  with  length 
L(u)  =  —  log  P(u)  (this  is  an  ideal  code  because  it  only  specifies  a 
codelength,  not  a  specific  binary  code,  and  because  the  codelengths 
produced  can  have  a  fractional  number  of  bits).  Then,  for  a  given 
error  residual  e  and  coefficients  magnitude  vector  v  =  [max{  |  a*  |  — 
Sa,  0}]i=i,...,m,  the  respective  ideal  codelengths  will  be  Le(e)  = 
EI7-log2  Qe(ei)  and  L„(v)  =  J2l=izk=i~  loS2  Q*K)-  pi- 
nally,  following  the  assumed  Ber(l/2)  model,  the  sign  of  each  non¬ 
zero  element  in  a  is  encoded  using  1  bit,  for  a  total  of  7  bits. 

3.2.  MDL-based  sparse  coding  algorithms 

The  goal  of  a  coding  algorithm  in  this  framework  is  to  obtain,  for 
each  j-th  sample  y y,  a  vector  a j  which  minimizes  its  description, 

a j  =  arg min L(e, a)  =  Le(ej)+Lc(z)  +  L(s|z)  +  L^(v|z).  (4) 

a 

As  it  happens  with  most  model  selection  algorithms,  considering  the 
support  size  (7  =  1 1  a ^  1 1 0 )  explicitly  in  the  cost  function  results  in 
a  non-convex,  discontinuous  objective  function.  A  common  proce¬ 
dure  in  sparse  coding  for  this  case  is  to  estimate  the  optimum  support 

zj  for  each  possible  support  size,  7  =  1,  2 , _ ,p.  Then,  for  each 

optimum  support  {zj  :  7  =  1, . . .  ,p},  (4)  is  solved  in  terms  of  the 
corresponding  non-zero  values  of  a  j,  yielding  a  candidate  solution 
aj,  and  the  one  producing  the  smallest  codelength  is  assigned  to  a j. 
As  an  example  of  this  procedure,  we  propose  Algorithm  1 ,  which  is 
a  variant  of  Matching  Pursuit  [8].  As  in  [8],  we  start  with  a°  =  0, 
adding  one  new  atom  to  the  active  set  in  each  iteration.  However, 
instead  of  adding  the  atom  that  is  maximally  correlated  with  the  cur¬ 
rent  residual  e  to  the  active  set,  we  add  the  one  yielding  the  largest 


Algorithm  1 :  Codelength-based  Forward  Selection. 

Input:  Data  sample  y,  dictionary  D 

Output:  The  sparse  code  for  y,  a 

initialize  a  <—  0;  e  <—  y;  L  <—  1/(0);  z  0  ; 

initialize  g  DTe  ;  //correlation  of  current  error  with  the  dictionary 

repeat 

for  k  =  1 ,  .  .  .  ,  p  :  Zk  =  0  do 

Afc  [gk]sa  ;  //step  A/g  is  correlation,  quantized  to prec.  Sa 
Lfc  <—  L( a  +  A fcOJfc)  ;  //ook  is  the  k-th  canonical  vector  of  Rp 

end 

Choose  k  =  arg  mirifc:2;fc=o{I/fc }; 

if  Lp  <  L  then 

zk  <-  u 

a  «—  a  +  A  pujp  ; 

g  <-  g-  Ajdj  ; 

end 

until  Lp,  >  L  ; 


decrease  in  overall  codelength.  The  algorithm  stops  when  no  further 
decrease  in  codelength  is  obtained  by  adding  a  new  atom. 

An  alternative  for  estimating  {zj  :  7  =  1, ...  ,p}  is  to  use  a 
convex  model  selection  algorithm  such  as  LARS/Lasso  [14],  which 
also  begins  with  a°  =  0,  adding  one  atom  at  a  time  to  the  solution. 
For  this  case  we  propose  to  substitute  the  i 2  loss  in  LARS/Lasso 
by  —  logLG(-),  which  is  more  consistent  with  (4)  and  can  be  ap¬ 
proximated  by  the  Huber  loss  function  [15].  This  alternative  will  be 
discussed  in  detail  in  the  extended  version  of  this  work. 

3.3.  Dictionary  learning 

Dictionary  learning  in  the  proposed  framework  proceeds  in  two 
stages.  In  the  first  one,  a  maximum  dictionary  size  pmax  is  fixed 
and  the  algorithm  learns  D  using  alternate  minimization  in  A  and 
D,  as  in  standard  dictionary  learning  approaches,  now  with  the  new 
codelength-based  metric.  First,  D  is  fixed  and  A  is  updated  as 
described  in  Section  3.2.  Second,  keeping  A  fixed,  the  update  of 
D  reduces  to  argminn  Le( Y  —  DA)  (recall  that  L( D)  depends 
only  on  p,  thus  being  constant  at  this  stage).  We  add  the  constraint 
that  ||dfc ||2  <  1 ,  k  =s  1, ...  ,p,  a  standard  form  of  dictionary  regu¬ 
larization.  Here  too  we  approximate  Le(*)  with  the  Huber  function, 
obtaining  a  convex,  differentiable  dictionary  update  step  which  can 
be  efficiently  solved  using  scaled  projected  gradient.  In  practice,  this 
function  produces  smaller  codelengths  than  using  standard  £2  -based 
dictionary  update,  at  the  same  computational  cost  (see  Section  4). 

In  the  second  stage,  the  size  of  the  dictionary  is  optimized  by 
pruning  atoms  whose  presence  in  D  actually  results  in  an  increased 
average  codelength.  In  practice,  the  final  size  of  the  dictionary  re¬ 
flects  the  intuitive  complexity  of  the  data  to  be  modeled,  thus  vali¬ 
dating  the  approach  (see  examples  in  Section  4). 

4.  RESULTS  AND  CONCLUSION 

The  first  experiment  assesses  that  the  learned  models  produce  com¬ 
pressed  descriptions  of  natural  images.  For  this,  we  adapted  a  dic¬ 
tionary  to  the  Pascal’ 06  image  database1,  and  encoded  several  of  its 
images.  The  average  bits  per  pixel  obtained  was  4.08  bits  per  pixel 
(bpp),  with  p  —  250  atoms  in  the  final  dictionary.  We  repeated  this 
using  £2  instead  of  Huber  loss,  obtaining  4.12  bpp  and  p  =  245. 

We  now  show  example  results  obtained  with  our  framework  in 
two  very  different  applications.  In  both  cases  we  exploit  spatial  cor- 
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Fig.  1.  Denoising  results.  Left  to  right:  PSNR  of  denoised  images 
for  different  methods  and  noise  levels  (including  [2]  as  a  reference), 
sample  recovered  patch  from  Barbara,  best  dictionary  for  Barbara 
(p  =  200,  we  used  an  initial  pmax  =  512). 


Fig.  2.  Segmentation  results.  Left  to  right:  sample  “Nat-5c” 
taken  from  http://www.ux.uis.no/~tranden/data.html,  obtained 
segmentation  (error  rate  7.5%),  learned  dictionaries  for  classes  1 
(pi  =  210)  and  2  (p2  =  201). 


relation  between  codes  of  adjacent  patches  by  learning  Markovian 
dependencies  between  their  supports  (see  [16]  for  related  work  in 
the  Bayesian  framework).  More  specifically,  we  condition  the  prob¬ 
ability  of  occurrence  of  an  atom  at  a  given  position  in  the  image,  on 
the  occurrence  of  that  same  atom  in  the  left,  top,  and  top-left  patches. 
Note  that,  in  both  applications,  the  results  were  obtained  without  the 
need  to  adjust  any  parameter  (although  Sa  is  a  parameter  of  Algo¬ 
rithm  1,  it  was  fixed  beforehand  to  a  value  that  did  not  introduce 
noticeable  additional  distortion  in  the  model,  and  left  untouched  for 
the  experiments  shown  here). 

The  first  task  is  to  estimate  a  clean  image  from  an  observed  noisy 
version  corrupted  by  Gaussian  noise  of  known  variance  a^.  Here  Y 
contains  all  (overlapping)  8x8  patches  from  the  noisy  image.  First, 
a  dictionary  D  is  learned  from  the  noisy  patches.  Then  each  patch  is 
encoded  using  D  with  a  denoising  variant  of  Algorithm  1 ,  where  the 
stopping  criterion  is  changed  for  the  requirement  that  the  observed 
distortion  falls  within  a  ball  of  radius  y/mcr^ .  Finally,  the  estimated 
patches  are  overlapped  again  and  averaged  to  form  the  denoised  im¬ 
age.  From  the  results  shown  in  Figure  1,  it  can  be  observed  that 
adding  a  Markovian  dependency  between  patches  consistently  im¬ 
proves  the  results.  Note  also  that  in  contrast  with  [2],  these  results 
are  fully  parameter  free,  while  those  in  [2]  significantly  depend  on 
carefully  tuned  parameters. 

The  second  application  is  texture  segmentation  via  patch  clas¬ 
sification.  Here  we  are  given  c  images  with  sample  textures,  and  a 
target  mosaic  of  textures,  and  the  task  is  to  assign  each  pixel  in  the 
mosaic  to  one  of  the  textures.  Again,  all  images  are  decomposed 
into  overlapping  patches.  This  time  a  dictionary  Dr  is  learned  for 
each  texture  r  m  1, . . . ,  c  using  the  patches  from  the  training  im¬ 
ages.  Then,  each  patch  in  the  mosaic  is  encoded  using  all  available 
dictionaries,  and  its  center  pixel  is  assigned  to  the  class  which  pro¬ 
duced  the  shortest  description  length  for  that  patch.  The  final  result 
includes  a  simple  3x3  median  filter  to  smooth  the  segmentation. 
We  show  a  sample  result  in  Figure  2.  Here,  again,  the  whole  process 
is  parameter  free,  and  adding  Markovian  dependency  improves  the 
overall  error  rate  (7.5%  against  8.5%). 

In  summary,  we  have  presented  an  MDL-based  sparse  modeling 
framework,  which  automatically  adapts  to  the  inherent  complexity  of 
the  data  at  hand  using  codelength  as  a  metric.  As  a  result,  the  frame¬ 
work  can  be  applied  out-of-the-box  to  very  different  applications, 
obtaining  competitive  results  in  all  the  cases  presented.  We  have 
also  shown  how  prior  information,  such  as  spatial  dependencies,  are 
easily  added  to  the  framework  by  means  of  probability  models. 
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